Importing the data
In this chapter, you’ll work with attendance data from public schools in the US, organized by school level and state, during the 2007-2008 academic year. The data contain information on average daily attendance (ADA) as a percentage of total enrollment, school day length, and school year length.
The data were given to you in an Excel spreadsheet, which you can download or view a screenshot of. Either way, take a moment to look at the spreadsheet.
Do you see any symptoms of untidy data? At first glance, it looks like the first row is a description of the data, the second row is a variable itself that groups multiple columns together, and the fourth row gives numbers for the columns, which might look nice in a spreadsheet but isn’t very useful for you, the analyst.
You’ll take it one step at a time to import the data using the gdata package. The name of this spreadsheet is "attendance.xls" and is available in your working directory.
# Load the gdata package
library(gdata)
prl <- "C:/myperl/perl/bin/perl5.28.1.exe"
# Import the spreadsheet: att
att <- read.xls("../xDatasets/attendance.xls", perl = prl)Whoops, looks like your call to `read.xls() triggered awarning`. You already got experience interpreting warning messages in previous chapters, so we’ll just go ahead and tell you: this one’s nothing to worry about.
Examining the data
Now that you’ve successfully imported the data into R, you can get a better idea of how the spreadsheet was read in.
# Print the column names
names(att)## [1] "Table.43..Average.daily.attendance..ADA..as.a.percentage.of.total.enrollment..school.day.length..and.school.year.length.in.public.schools..by.school.level.and.state..2007.08"
## [2] "X"
## [3] "X.1"
## [4] "X.2"
## [5] "X.3"
## [6] "X.4"
## [7] "X.5"
## [8] "X.6"
## [9] "X.7"
## [10] "X.8"
## [11] "X.9"
## [12] "X.10"
## [13] "X.11"
## [14] "X.12"
## [15] "X.13"
## [16] "X.14"
## [17] "X.15"
# Print the first 6 rows
att %>%
head() %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| Table.43..Average.daily.attendance..ADA..as.a.percentage.of.total.enrollment..school.day.length..and.school.year.length.in.public.schools..by.school.level.and.state..2007.08 | X | X.1 | X.2 | X.3 | X.4 | X.5 | X.6 | X.7 | X.8 | X.9 | X.10 | X.11 | X.12 | X.13 | X.14 | X.15 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Total elementary, secondary, and combined elementary/secondary schools | Elementary schools | Secondary schools | ||||||||||||||
| ADA as percent of enrollment | Average hours in school day | Average days in school year | Average hours in school year | ADA as percent of enrollment | Average hours in school day | ADA as percent of enrollment | Average hours in school day | |||||||||
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ||||||||
| United States …….. | 93.1 | (0.22) | 6.6 | (0.02) | 180 | (0.1) | 1,193 | (3.1) | 94.0 | (0.27) | 6.7 | (0.02) | 91.1 | (0.43) | 6.6 | (0.04) |
| Alabama …………….. | 93.8 | (1.24) | 7.0 | (0.07) | 180 | (0.8) | 1,267 | (12.3) | 93.8 | (1.84) | 7.0 | (0.08) | 94.6 | (0.38) | 7.1 | (0.17) |
| Alaska ……………… | 89.9 | (1.22) | 6.5 | (0.05) | 180 | (3.4) | 1,163 | (22.9) | 91.3 | (1.56) | 6.5 | (0.05) | 93.2 | (1.57) | 6.2 | (0.15) |
# Print the last 6 rows
att %>%
tail() %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| Table.43..Average.daily.attendance..ADA..as.a.percentage.of.total.enrollment..school.day.length..and.school.year.length.in.public.schools..by.school.level.and.state..2007.08 | X | X.1 | X.2 | X.3 | X.4 | X.5 | X.6 | X.7 | X.8 | X.9 | X.10 | X.11 | X.12 | X.13 | X.14 | X.15 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 54 | Wisconsin …………… | 95.0 | (0.57) | 6.9 | (0.04) | 180 | (0.7) | 1,246 | (8.6) | 95.4 | (0.41) | 6.9 | (0.05) | 93.0 | (1.91) | 7.0 | (0.14) |
| 55 | Wyoming …………….. | 92.4 | (1.15) | 6.9 | (0.05) | 175 | (1.3) | 1,201 | (8.3) | 92.2 | (1.65) | 6.9 | (0.05) | 92.4 | (0.75) | 7.0 | (0.07) |
| 56 | â Not applicable. | ||||||||||||||||
| 57 | â¡Reporting standards not met (too few cases). | ||||||||||||||||
| 58 | NOTE: Averages reflect data reported by schools rather than state requirements. School-reported length of day may exceed state requirements, and there is a range of statistical error in reported estimates. Standard errors appear in parentheses. | ||||||||||||||||
| 59 | SOURCE: U.S. Department of Education, National Center for Education Statistics, Schools and Staffing Survey (SASS), School Questionnaire | 2003-04 and 2007-08. (This table was prepared June 2011.) |
# Print the structure
str(att, give.attr = FALSE)## 'data.frame': 59 obs. of 17 variables:
## $ Table.43..Average.daily.attendance..ADA..as.a.percentage.of.total.enrollment..school.day.length..and.school.year.length.in.public.schools..by.school.level.and.state..2007.08: Factor w/ 58 levels ""," United States ........",..: 1 1 3 2 6 7 8 9 10 11 ...
## $ X : Factor w/ 42 levels "","\\ 2003-04 and 2007-08. (This table was prepared June 2011.)",..: 42 41 3 22 28 8 6 14 23 29 ...
## $ X.1 : Factor w/ 45 levels "","(0.22)","(0.23)",..: 1 1 1 2 22 21 41 27 14 6 ...
## $ X.2 : Factor w/ 14 levels "","3","6.2","6.3",..: 1 14 2 7 11 6 5 10 3 11 ...
## $ X.3 : Factor w/ 14 levels "","(0.02)","(0.03)",..: 1 1 1 2 7 5 9 6 7 5 ...
## $ X.4 : Factor w/ 15 levels "","171","172",..: 1 15 14 10 10 10 11 9 11 2 ...
## $ X.5 : Factor w/ 22 levels "","(0.0)","(0.1)",..: 1 1 1 3 10 21 19 4 6 12 ...
## $ X.6 : Factor w/ 48 levels "","1,102","1,117",..: 1 48 47 26 45 15 13 36 5 28 ...
## $ X.7 : Factor w/ 48 levels "","(10.1)","(10.3)",..: 1 1 1 36 11 35 22 6 13 48 ...
## $ X.8 : Factor w/ 40 levels "","6","81.0",..: 40 39 2 22 20 9 5 13 27 25 ...
## $ X.9 : Factor w/ 51 levels "","(0.24)","(0.25)",..: 1 1 1 4 32 25 48 35 20 13 ...
## $ X.10 : Factor w/ 14 levels "","6.2","6.3",..: 1 14 10 7 11 5 4 9 3 11 ...
## $ X.11 : Factor w/ 13 levels "","(0.02)","(0.03)",..: 1 1 1 2 8 5 10 8 5 7 ...
## $ X.12 : Factor w/ 41 levels "","8","85.8",..: 41 40 2 18 34 29 7 17 8 19 ...
## $ X.13 : Factor w/ 47 levels "","(0.35)","(0.37)",..: 1 1 1 6 4 24 42 34 22 19 ...
## $ X.14 : Factor w/ 17 levels "","5.9","6.1",..: 1 17 15 8 13 4 6 10 3 12 ...
## $ X.15 : Factor w/ 29 levels "","(0.03)","(0.04)",..: 1 1 1 3 14 12 20 7 17 8 ...
These are some messy data! The column names are mostly missing, there are irrelevant notes at the end of the data frame, and it looks like the numeric data were imported as factors. Let’s start the cleaning process!
Removing unnecessary rows
Again, for reference, here is an image of the first 22 rows of the original spreadsheet you were given. Looking at this image, you can see that rows 1, 4, 11, and 17 of the spreadsheet are useless. But is it safe to do something like the following?
att2 <- att[-c(1, 4, 11, 17), ]
No! From the last exercise, you might have realized that the read.xls() function actually imported the first row of the original data frame as the variable name for the first column. Did you notice that the first 6 rows of att aren’t the same as the first six rows you saw in the original spreadsheet? What about the 11th and 17th rows?
When you’re importing a messy spreadsheet into R, it’s good practice to compare the original spreadsheet with what you’ve imported. It turns out that, by default, the read.xls() function skips empty rows such as the 11th and 17th.
After viewing your data frame, you realize you still need to get rid of the third row of att, as well as rows 56 through 59.
# Create remove
remove <- c(3, 56, 57, 58, 59)
# Create att2
att2 <- att[-remove, ]Removing useless columns
Once more, for reference, here is an image of the first 22 rows of the original spreadsheet. You can see here that the columns 3, 5, 7, 9, 11, 13, 15, and 17 (or columns C, E, G, I, K, M, O, Q in Excel) don’t contain the values of average daily attendance (ADA).
# Create remove
remove <- c(3,5,7,9,11,13,15,17)
# Create att3
att3 <- att2[ ,-remove]Splitting the data
In many cases, a single data frame stores multiple “tables” of information. You can often diagnose this problem by looking at the column names and noticing duplicate rows.
In this data frame, columns 1, 6, and 7 represent attendance data for US elementary schools, columns 1, 8, and 9 represent data for secondary schools, and columns 1 through 5 represent data for all schools in the US.
Each of these should be stored as its own separate data frame, so you’ll split them up here.
# Subset just elementary schools: att_elem
att_elem <- att3[ , c(1,6,7)]
# Subset just secondary schools: att_sec
att_sec <- att3[ , c(1,8,9)]
# Subset all schools: att4
att4 <- att3[ , c(1:5)]From here on, you’ll just be working with the data pertaining to all schools in the US.
Replacing the names
Since you went through so much trouble finding out which row stored the variable names, you should store that row as the actual column names of the data frame. We’ve modified the names a bit in order to be more stylistically sound; they’re stored as cnames in the editor.
This will also allow you to remove the first two rows (currently storing variable names).
# Define cnames vector (don't change)
cnames <- c("state", "avg_attend_pct", "avg_hr_per_day",
"avg_day_per_yr", "avg_hr_per_yr")
# Assign column names of att4
colnames(att4) <- cnames
# Remove first two rows of att4: att5
att5 <- att4[-c(1,2), ]
# View the names of att5
names(att5)## [1] "state" "avg_attend_pct" "avg_hr_per_day" "avg_day_per_yr"
## [5] "avg_hr_per_yr"
Cleaning up extra characters
One of the most irritating things about this dataset is that the state names are all stored as the same number of characters, with periods padding the ends of the shorter states. That may be helpful for reading the spreadsheet, but it makes your life harder, so you’ll deal with it in this exercise.
One pitfall to avoid: . is a special character in the language of regular expressions (a.k.a. regex). In order to specify that you actually want to remove periods and not their regex equivalent (which is “all characters”), use \\.. This is called an "escape" sequence.
library(stringr)
# Remove all periods in state column
att5$state <- str_replace_all(att5$state, "\\.", "")
# Remove white space around state names
att5$state <- str_trim(att5$state)
# View the head of att5
att5 %>%
head() %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| state | avg_attend_pct | avg_hr_per_day | avg_day_per_yr | avg_hr_per_yr | |
|---|---|---|---|---|---|
| 4 | United States | 93.1 | 6.6 | 180 | 1,193 |
| 5 | Alabama | 93.8 | 7.0 | 180 | 1,267 |
| 6 | Alaska | 89.9 | 6.5 | 180 | 1,163 |
| 7 | Arizona | 89.0 | 6.4 | 181 | 1,159 |
| 8 | Arkansas | 91.8 | 6.9 | 179 | 1,229 |
| 9 | California | 93.2 | 6.2 | 181 | 1,129 |
Some final type conversions
Finally, you’ll convert the values in certain variables to numerics (instead of factors). It’s worth noting that in previous chapters, your numerical data has often come in as character strings. This is just a difference between read.xls() and the other import functions you’ve used.
The dplyr package offers an efficient method for applying a function to many columns at once. If you’d like to learn more, check out DataCamp’s course on Data Manipulation in R with dplyr! Since you might not have taken the dplyr course yet, we’re just showing you the code here. You’ll do the same thing, but using subsetting and sapply() instead.
# Change columns to numeric using dplyr (don't change)
library(dplyr)
example <- mutate_at(att5, vars(-state), funs(as.numeric))## Warning: funs() is soft deprecated as of dplyr 0.8.0
## please use list() instead
##
## # Before:
## funs(name = f(.)
##
## # After:
## list(name = ~f(.))
## This warning is displayed once per session.
# Define vector containing numerical columns: cols
cols <- -1
# Use sapply to coerce cols to numeric
att5[, cols] <- sapply(att5[, cols], as.numeric)